Bridging the Gap between HPC and Big Data frameworks

نویسندگان

  • Michael J. Anderson
  • Shaden Smith
  • Narayanan Sundaram
  • Mihai Capota
  • Zheguang Zhao
  • Subramanya Dulloor
  • Nadathur Satish
  • Theodore L. Willke
چکیده

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HPC and Big Data Convergence for Extreme Heterogeneous Systems

As the data deluge grows ever greater, large-scale data analytics workloads are quickly becoming critical computational tools within the scientific community. Recently, convergence efforts have focused on combining aspects HPC and ”big data” analytics workloads together using a unified supercomputing system. This has the opportunity to bring advanced analytical tools to scientists which enable ...

متن کامل

2018-00574 - HPC-Big Data convergence at processing level by bridging in situ/in transit processing with Big Data analytics

This PhD will be done in the context of the Inria Project Lab (IPL) HPC-BigData: High Performance Computing and Big Data. The goal of this IPL is to gather teams from HPC, Big Data and Machine Learning (ML) areas to work at the intersection between these domains. External partners include: ATOS/Bull, Argonne National Lab (ANL), Laboratoire de Biochimie Théoerique (LBT), CNRS, ESI-Group, Grid’5000.

متن کامل

Big Data at HPC Wales

This paper describes an automated approach to handling Big Data workloads on HPC systems. We describe a solution that dynamically creates a unified cluster based on YARN in an HPC Environment, without the need to configure and allocate a dedicated Hadoop cluster. The end user can choose to write the solution in any combination of supported frameworks, a solution that scales seamlessly from a fe...

متن کامل

Causes of the Gap between Junior High School Intended, Implemented, and Attained Curricula and Ways of Bridging It

Causes of the Gap between Junior High School Intended, Implemented, and Attained Curricula and Ways of Bridging It   M.A. Jamaalifar* S. Sh. HaashemiMoghadam, Ph.D.** Z. Aabedi Karajibaan, Ph.D.*** A.R. Faghihi, Ph.D.****   To identify the causes of the perceived gap between junior high school intended, implemented, and attained curricula, a group of 30 curriculum planners, 50 educationa...

متن کامل

Cross border E-Science and Research Partnership: Bridging the Gap Between Science and Media

  E-Science is a tool that helps scientists to store, interpret, analyze and make a network of their data, and it can play a critical role in different aspects of the scientific goals and research. This commentary, under the topic of Cross Border E-Science and Research Partnership: Bridging the Gap between Science and Media,[1] attempts to shed light on E-Science with emphasis on three importa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017